Whats inside the news?

Franziska Löw

December 16, 2017

Agenda

  1. Introduction

  2. Methodology

  3. First Results

  4. Whats next

Introduction

Online News

Business Model of online News

  • Does the business model have an effect on the editorial content?
  • Methodology:
  1. Estimate a Structural Topic Model
  2. Use posterior distribution to estimate the effect of document metadata.

Data

Online news articles about domestic politics from 01.06.2017 - 01.12.2017

Concepts

  • A single observation in a textual database is called a document.

  • The set of documents that make up the dataset is called a corpus.

  • Covariates associated with each document are called metadata.

Data Structure

afd beruf cdu funktion helmut jahre kohl merkel prozent schulz
4214 2 0 13 0 0 4 0 106 13 62
5412 137 69 22 69 0 78 0 3 3 3
5442 134 0 71 0 0 4 0 59 88 32
582 0 0 19 0 144 28 235 3 1 0
6282 206 94 27 94 0 104 0 5 2 3
  • Documents (articles) are stored on a “Document-Term-Matrix”

  • Documents are seen as “bag of words”

  • Each article has metadata: publisher (news platform) and the day it was published.

How to find out latent topics in an article?

Methodology

Topic Model

Credits: Christine Doig

The intuition behind LDA

Credits: Blei (2012)

LDA as a graphical model

  • Nodes are random variables; arrows indicate dependence
  • Plates indicate replicated variables:
    • \(N =\) collection of words within a document.
    • \(D =\) collection of documents within a corpus.
  • Shaded nodes are observed; unshaded nodes are hidden
    • observed: word in a document \(w_{d,n}\)
    • fixed: mixture components (number of topics \(K\) & vocabulary)
    • hidden: mixture proportions (per-document topic proportions \(\theta_d\) & word-topic distribution \(\beta_k\))

LDA - joint distribution

\[ \begin{aligned} p(\beta_{1:K},\theta_{1:D},z_{1:D}, w_{1:D}) \propto \end{aligned} \]

\[ \begin{aligned} \displaystyle \prod_{i=1}^{K}p(\beta_i)\displaystyle \prod_{d=1}^{D}p(\theta_d)(\prod_{n=1}^Np(z_{d,n,}|\theta_d)p(w_{d,n}|\beta_{1:K},z_{d,n})) \end{aligned} \]

Generative process

  • \(K\): choose the number of topics

    • \(K=3\)
  • \(\theta_d\): for each document \(d\), choose a distribution over topics;

    • \(\theta_d\) ~ Dirichlet(\(\alpha\))
  • \(z_{d,n}|\theta_d\): according to \(\theta_d\), assign a topic \(z_{d,n}\) for the \(n^{th}\) word;

    • \(K=Topic 1\)

Generative process (contd.)

  • \(w_{d,n}|z_{d,n},\beta,\theta\): choose a term from that topic according to \(\beta_k\)

    • \(\beta_k\) ~ Dirichlet(\(\eta\))

  • \(N\): repead this process for all \(n\) word-positions in the document.

  • \(D\): conduct this process for all \(d\) documents in the corpus

Research Process

Strucutral Topic Model (Roberts et. al. (2016))

  • Including covariates into a topic model:
  1. Topic Prevalence: Attributes that affect the likelihood of discussing topic \(k\)

    News platform & the month the article was published.
  2. Topic Content: Attributes that affect the likelihood of including term \(w\) overall, and of including it within topic \(k\)

    Not yet included.

STM Priors

Bayesian inference

Posterior probability:

\[ \begin{aligned} p(\eta,z,\kappa,\gamma,\Sigma|w,X,Y) \propto \displaystyle \prod_{d=1}^{D}(\eta_d|X_d\gamma,\Sigma) (\displaystyle \prod_{n=1}^{D}(z_{n,d}|\theta_d)*(w_n|\beta_{d,k=z_{d,n}}))) * \displaystyle \prod p(\kappa)\displaystyle \prod p(\Gamma) \end{aligned} \]

  • The number of possible topic structures is exponentially large –> sum is intractable to compute.
  • Instead of obtaining a closed-form solution for the posterior distribution, we must approximate it.

Approximate Posterior

Central research goal of probabilistic modeling: develop efficient methods for approximating posterior.

  • Mean field variational methods (Blei et al., 2001, 2003)
  • Expectation propagation (Minka and Lafferty, 2002)
  • Collapsed Gibbs sampling (Griffiths and Steyvers, 2002)
  • Distributed samplung (Newsman et al., 2008; Ahmed et al., 2012)
  • Stochastic inference (Hoffman et al., 2010, 2013; Mimno et al., 2012)
  • Factorization inference (Arora et al., 2012; Anandkumar et al., 2012)
  • Variational EM algorithm (Wang and Blei 2013; Roberts et. al. (2016)

Model Selection

Prior Specification:

\(\gamma_{p,k}\) ~ Normal\((0,\sigma^2_k)\)

\(\sigma^2_k\) ~ Inverse-Gamma\((a,b)\)

Topic Selection:

\(K=28\)

Model Results

Topic Proportions

Timeline

Sample Articles

topic_title title
Topic 01 - spd schulz martin nahles gabriel SPD-Fraktion - Oppermann als Bundestagsvize-Kandidat gewählt
Topic 01 - spd schulz martin nahles gabriel Wahlverlierer Schulz: Wer stützt ihn eigentlich noch in der SPD?
Topic 01 - spd schulz martin nahles gabriel SPD-Vize Scholz will Mindestlohn auf zwölf Euro anheben
Topic 05 - merkel schulz kanzlerin duell Angela Merkel: Schulz forderte zweites TV-Duell – Kanzlerin lehnt ab - FOCUS Online
Topic 05 - merkel schulz kanzlerin duell Bundestagswahl: Merkel lehnt zweites TV-Duell ab | ZEIT ONLINE
Topic 05 - merkel schulz kanzlerin duell Am Handy: Wie Jean-Claude Juncker Angela Merkel mit seiner Frau verwechselte
Topic 06 - grünen csu cdu familiennachzug obergrenze jamaika Was Union, FDP und Grüne programmatisch trennt und eint
Topic 06 - grünen csu cdu familiennachzug obergrenze jamaika 50.000 und 750.000?: Über die Zahlen zum Familiennachzug streiten die Sondierer am meisten - WELT
Topic 06 - grünen csu cdu familiennachzug obergrenze jamaika Grüne stellen Jamaika-Zwischenergebnis bei Finanzen infrage
Topic 07 - trump merkel us donald macron Atom-Konflikt: Kim Jong Un nennt Trump “geisteskranken, dementen Greis”
Topic 07 - trump merkel us donald macron Rückzug der USA: Jetzt erst recht: Bundeskanzlerin will stärker für Klimaschutz kämpfen
Topic 07 - trump merkel us donald macron Angela Merkel will an Pariser Klimaabkommen festhalten

Sample Articles (contd)

topic_title title
Topic 09 - hamburg polizei gipfel scholz gipfels G20-Gipfel in Hamburg: Randalierer setzen Autos in Brand
Topic 09 - hamburg polizei gipfel scholz gipfels Streit um Polizeieinsatz: G20-Camps zum Schlafen bleiben verboten
Topic 09 - hamburg polizei gipfel scholz gipfels Sitzblockaden geräumt: Nach G20-Ende: Neue Zusammenstöße in Hamburg
Topic 16 - fdp jamaika grünen sondierungen neuwahlen FDP und Grüne sondieren für Jamaika
Topic 16 - fdp jamaika grünen sondierungen neuwahlen Angela Merkel zu Jamaika: “Wir übernehmen Verantwortung für dieses Land in schwierigen Stunden” - WELT
Topic 16 - fdp jamaika grünen sondierungen neuwahlen Jamaika-Parteien bringen erstes Gespräch hinter sich - bald könnte es knirschen
Topic 19 - afd gauland weidel AfD-Wahlplakat mit Schweizer Matterhorn sorgt für Lacher – Partei wehrt sich - FOCUS Online
Topic 19 - afd gauland weidel “PARTEI” kapert AfD-Gruppen: “Von nun an von echten Menschen verarscht” | faktenfinder.tagesschau.de
Topic 19 - afd gauland weidel Abfrage von sensiblen Daten: AfD lenkt nach Kritik an Parteitags-Anmeldung ein
Topic 24 - afd petry partei frauke Bundesvorstand: Nimmt Björn Höcke jetzt Kurs auf AfD-Spitzenposition? - WELT
Topic 24 - afd petry partei frauke Frauke Petry: Ausschuss empfiehlt Aufhebung der Immunität von AfD-Politikerin
Topic 24 - afd petry partei frauke Berliner AfD: Beatrix von Storch verliert Führungsposten

Estimate Effect of Covariates

Estimate the conditional expectation of topic prevalence for given document characteristics (lm on compositional data):

\[ \theta_d=\alpha+\beta_1 x_{site}+\beta_2x_{month}+\epsilon \]

Whats next?

Whats next?

  1. Estimate model, including the effect of covariates on topical content How are topics discused within different newswires?

  2. Relationship between topics How topics are correlated differently for different newswires, indicating how topics are connected and framed differently in each newswire.